Model Selection

Image-text generation

# Image-text generation

Gemma 3 12b It Quantized.w8a8

An INT8 quantized version based on google/gemma-3-12b-it, supporting visual text input and text output, suitable for efficient inference deployment

Xlangai Jedi 3B 1080p GGUF

Jedi-3B-1080p is a 3B-parameter model developed by xlangai, quantized using llama.cpp, suitable for image-text generation tasks.

Large Language Model English

Medgemma 4b It GGUF

medgemma-4b-it is a multimodal model focused on the medical field, capable of processing image and text inputs, and suitable for multiple medical scenarios such as radiology and clinical reasoning.

Internvl3 8B Hf

InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.

Transformers Other

Internvl3 2B Hf

InternVL3-2B is a multimodal large language model implemented based on the Hugging Face Transformers library. It performs excellently in multimodal tasks such as image, video, and text processing, supporting multiple input methods and efficient batch inference.

Transformers Other

Kimi VL A3B Thinking 8bit

Kimi-VL-A3B-Thinking-8bit is a multimodal vision-language model converted based on the MLX format, supporting image-text to text generation tasks.

Transformers Other

Gemma 3 27b It Qat Bf16

Gemma 3 27B IT QAT BF16 is a version of the Gemma series of models released by Google. It has undergone quantization-aware training (QAT) and is converted to the BF16 format, suitable for the MLX framework.

Gemma 3 12b It Qat Int4 Unquantized

Gemma 3 is a lightweight multimodal open model from Google, supporting text and image inputs with text output, featuring a 128K large context window and multilingual capabilities.

Gemma 3 4b It Int4 Awq

Gemma is a lightweight, advanced open model series from Google, built using the same research technology as Gemini. Gemma 3 is a multimodal model capable of processing both text and image inputs to generate text outputs.

Smoldocling 256M Preview Mlx Fp16

This model is converted from ds4sd/SmolDocling-256M-preview to the MLX format, supporting image-text-to-text tasks.

Transformers English

Bytedance Research.ui TARS 72B SFT GGUF

A 72B-parameter multimodal foundation model released by ByteDance Research, specializing in image-text-to-text tasks

Aya Vision 8B is an open-weight 8-billion-parameter multilingual vision-language model supporting visual and language tasks in 23 languages.

Transformers Supports Multiple Languages

Gemma is a lightweight open-source multimodal model series launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs.

Aria Sequential Mlp FP8 Dynamic

FP8 dynamically quantized model based on Aria-sequential_mlp, suitable for image-text-to-text tasks, requiring approximately 30GB VRAM.

Florence 2 Flux Large

A vision-language model based on Microsoft Florence-2-large, excelling in image understanding and text generation tasks

Transformers Supports Multiple Languages

IDEFICS is an open-source multimodal model capable of processing both image and text inputs to generate text outputs, serving as an open-source reproduction of Deepmind's Flamingo model.

Transformers English

Blip2 Image To Text

BLIP-2 is a vision-language pre-trained model that achieves language-image pre-training guidance by freezing the image encoder and large language model.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase